High-quality bilingual subtitle document alignments with application to spontaneous speech translation
نویسندگان
چکیده
In this paper, we investigate the task of translating spontaneous speech transcriptions by employing aligned movie subtitles in training a statistical machine translator (SMT). In contrast to the lexical-based dynamic time warping (DTW) approaches to bilingual subtitle alignment, we align subtitle documents using time-stamps. We show that subtitle time-stamps in two languages are often approximately linearly related, which can be exploited for extracting high-quality bilingual subtitle pairs. On a small tagged data-set, we achieve a performance improvement of 0.21 F-score points compared to traditional DTW alignment approach and 0.39 F-score points compared to a simple line-fitting approach. In addition, we achieve a performance gain of 4.88 BLEU score points in spontaneous speech translation experiments using the aligned subtitle data obtained by the proposed alignment approach compared to that obtained by the DTW based alignment approach demonstrating the merit of the time-stamps based subtitle alignment scheme. © 2011 Elsevier Ltd. All rights reserved.
منابع مشابه
JESC: Japanese-English Subtitle Corpus
In this paper we describe the Japanese-English Subtitle Corpus (JESC). JESC is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web. The assembly process incorp...
متن کاملContext-driven automatic bilingual movie subtitle alignment
Movie subtitle alignment is a potentially useful approach for deriving automatically parallel bilingual/multilingual spoken language data for automatic speech translation. In this paper, we consider the movie subtitle alignment task. We propose a distance metric between utterances of different languages based on lexical features derived from bilingual dictionaries. We use the dynamic time warpi...
متن کاملThe ADAPT Bilingual Document Alignment system at WMT16
Comparable corpora have been shown to be useful in several multilingual natural language processing (NLP) tasks. Many previous papers have focused on how to improve the extraction of parallel data from this kind of corpus on different levels. In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score. We describe our...
متن کاملEvaluation of alternatives on speech to sign language translation
This paper evaluates different approaches on speech to sign language machine translation. The framework of the application focuses on assisting deaf people to apply for the passport or related information. In this context, the main aim is to automatically translate the spontaneous speech, uttered by an officer, into Spanish Sign Language (SSL). In order to get the best translation quality, thre...
متن کاملMaximizing Component Quality in Bilingual Word-Aligned Segmentations
Given a pair of source and target language sentences which are translations of each other with known word alignments between them, we extract bilingual phrase-level segmentations of such a pair. This is done by identifying two appropriate measures that assess the quality of phrase segments, one on the monolingual level for both language sides, and one on the bilingual level. The monolingual mea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Speech & Language
دوره 27 شماره
صفحات -
تاریخ انتشار 2013